Adaptive Speech Synthesis in a Cognitive Robotic Service Apartment: An Overview and First Steps Towards Voice Selection
نویسندگان
چکیده
The Cognitive Robotic Service Apartment is both a realistic apartment and a laboratory environment in which the one or several user(s) interact with various manifestations of an intelligent agent e.g. a talking head. We expect that across various situational settings in the apartment, different specifications and adaptations of the synthetic voice will become necessary. Some of the dynamic adaptations will depend on physical factors e.g. ambient noise affecting speech intelligibility others on interpersonal factors e.g. familiarity and even others on the manifestation of the artificial agent itself e.g. the agent's voice, perceived gender, age and competence. It is the overall aim of our ongoing project to build a voice for a dynamically speech synthesis adaptation across various typical interaction scenarios and agent manifestations (robot, virtual agent). In the final implementation, the voice adaptation will be realized incrementally, i.e. the adaptation will be effected while talking. The adaptive synthesis module will be extended the existed incremental speech process system InproTK that is part of the cognitive architecture of the apartment. In order to determine an ideal set of adaptive parameters, a series of experiments is currently being planned and carried out. The paper will present our general methodology and describes our first study to find suitable synthesis voices for the virtual agent or humanoid robot used in the Cognitive Robotic Service Apartment. 1. Modeling Adaptive Speech Synthesis in the CSRA Modeling the interaction between humans and machines remains a major speech technological challenge. This affects not only the interfaces between the different interacting system components (ASR, NLU, dialogue model, NLG, TTS) but each component individually. Our present project focuses on the improvement of a speech synthesis component in an interactive system in general, and on the situation-specific adaptation and modification of the synthetic speech output in particular. Such adaptations of the voice, driven by communicative purposes, are natural in humans and necessary in machines mimicking human speech communication. The present paper outlines how dynamically adaptive synthetic speech is realized in an ongoing research project as part of a complex interaction environment called the Cognitive Service Robotic Apartment (CSRA). 1.1 Situation-specific adaption in human speech production An everyday example for situation-specific human speech adaptation has become famous as the Lombard Effect: Quite often dialogues between humans take place in noisy environments (outdoors in the presence of traffic noise; indoors with background music or with several people engaged in chatting simultaneously, e.g. in a pub). These conditions impede the intelligibility of the spoken content caused both by limited transmission quality and by the speakers' limited ability to self-monitor their voices. E. Lombard was the first researcher who discovered adaptation processes in speech produced under noisy conditions and his findings initiated a lot of subsequent research in this area [9]. His main observation was that selfmonitoring is the regulator between speech production and perception and that lacking selfmonitoring leads to an involuntary adaptation process to the environmental conditions, i.e. it leads to Lombard Speech. Many studies investigated the Lombard Effect from a medical or psychological perspective, but more recently, it has been investigated also from an acoustic, phonetic, linguistic and speech technological perspective. These studies could show that compared to speech in a quiet environment, Lombard Speech exhibits a decreased speaking rate, an increased fundamental frequency (F0) and range, a shift of intensity from low to high frequency, an increased vowel duration and a shift of F1 and F2 [6, 10]. However, the identified differences depend both on the speaker and the amount and type of ambient noise [7]. Lombard Speech adaptations occur spontaneously, immediately and unintentionally and thus have a different cause than phonetically similar, but intended adaptations such as the kind of speech addressed at an inattentive listener, a distant listener, a bad ASR, a listener with hearing problems, or a listener unaware of a potential danger. So far, very little is known about Lombard Speech occurring under real-life communicative conditions as it has mostly been investigated in monologue reading tasks. Still, it can be safely assumed that humanhuman communication certainly profits from Lombard Speech as its adaptations serve to improve intelligibility [6, 11]. Therefore, despite the fact that we cannot know precisely whether intended adaptations made for the cause of an improved intelligibility resembles Lombard Speech in all its facets, we make this simplified assumption in our ongoing study. 1.2 Adaptive interaction in the CSRA Our project’s interaction architecture is a Cognitive Robotic Service Apartment (CSRA). Unlike typical speech synthesis evaluations, this setting enables us to evaluate our adaptation strategies both under real-life and laboratory conditions. The former is possible as the humanapartment interactions are monitored permanently and across a wide range of everyday “university lab” situations such as demo tours, meetings or lunchtime chats in course of which individuals or groups interact with the interactive components both verbally and non-verbally. The verbal interactions will use different manifestations of intelligent agents such as a humanoid robot, a virtual agent or a disembodied apartment voice. Therefore, the agent’s interaction strategy should suit various settings (information, service, interaction with a group, interaction with an individual, formal/informal settings) and their concrete manifestations (background music, quiet environment, attentive/inattentive user). We assume that the perceived interaction quality is at least to some extent influenced by the agents’ overall voice quality and design as these factors are associated with characteristics such as perceived competence, trustworthiness, dominance, anxiety, reliability or credibility. Therefore, in a first step, a set of suitable voices and designs working across various types of artificial agents and situations needs to be determined. It is possible that the suitability of a voice is to some extent situation dependent, e.g. it might be more important to have a “competent” sounding voice in a formal situation where the agent explains something, while a “friendly, warm” voice might be more important in an informal situation where the agent welcomes the user. 1.3 Modeling adaptation in synthetic speech In contrast to speech recognition systems [10, 8], the realization of Lombard Speech or similar types of environmental adaptation in synthetic speech synthesis is hitherto not well understood. This comes somewhat as a surprise as such adaptations can be expected to improve both intelligibility and perceived naturalness. Two potential adaptive strategies can be identified. One approach is the generation of an artificial voice trained with a different speaking style, e.g. a Lombard Speech corpus recorded in a noisy environment. Those methods produce speaker-dependent synthetic voices and require a large amount of training data [15]. Another strategy lies in the modification of an existing 'neutrally speaking' voice. Such adaptations are achieved via the modification of extracted speech parameters such as F0, energy or spectral characteristic and a subsequent re-synthesis. One advantage of this solution is that no new training data are required. More importantly, such an adaptation can be performed dynamically, speedy and incrementally, without the need to switch to a different “voice”. Such a dynamic, incremental type of adaptation to the situational needs models the automaticity of the Lombard Effect in humans (cf. above) and may therefore significantly contribute to the perceived naturalness of the resulting interaction, as has been previously shown for other aspects of verbal interaction in human-machine dialogues (cf. below). In order to objectively assess the intelligibility of the synthetic speech thus modified, several solutions were proposed in the literature, mainly based on human auditory system modeling (Glimpse Proportion, Dau model) and relying on the signal-to-noise ratio (SNR) [4]. 1.4 Adaptation as part of incremental speech processing In order to realize the situation-specific adaptations, the synthetic speech is realized within the speech-processing tool InproTK as part of the cognitive architecture of the CSRA apartment [1]. It includes a speech recognition module and a speech synthesis module and manages the speech input and output for the human-machine communication together with the dialog management tool Pamini [9]. Our speech adaptation module is based on the speech synthesis module using a modified version of the MaryTTS synthesizer [13]. This modification of the internal data structures was necessary to support the incremental processes offered by InproTK: Incremental speech processing means that the system can react just-in-time to situational changes in speech, e.g. disfluencies, interruptions or other environmental changes both on the side of speech recognition and speech synthesis. This is reached by a step-by-step bottom-up process. Each utterance is split into chunks (Incremental Units), which can be phonemes, words or an entire phrase before handled. For any type of adaptation, this functionality is highly suitable because it allows prosodic changes of speech such as the intensity or loudness in course of the synthesis process. Many conventional text-to-speech systems are based on the sequential processing of utterances. That means, before a next sentence is processed, the previous sentence is synthesized completely. Such a traditional architecture allows adaptation only on a full incoming utterance, but not in course of an ongoing synthesis on its parts. An incremental architecture allows for more flexibility. For example, an incremental Lombard adaptation may continuously modify the synthetic speech output in the presence of steadily increasing background noise. InproTK already includes first extensions of voice adaptation handling prosodic changes such as pitch, duration, loudness and spectral shifts of a MaryTTS HMM voice. These manipulations are carried out on the phone level. Furthermore, a demonstrator exists which provides the possibility to manipulate pitch, duration and loudness of a complex sentence during the synthesis process [2]. 1.5 First steps towards voice selection As a first step towards realizing synthetic speech within the CSRA, the general suitability of a set of synthetic voices for typical interaction situations is evaluated. As it is planned to carry out the situation-specific adaptations incrementally, only parametric synthesis voices can be used. To this end, three voices (two male, one female) already integrated in the MaryTTS synthesis architecture are part of this assessment. In addition, a newly created female voice is tested. The voice suitability is tested for two typical interaction scenarios (greeting, information) under both formal and informal conditions. The suitability is inferred from the perceived competence and warmth of the agent. Finally, we want to find out the best matches between existing voices and agent type (robot, virtual agent). The experiment and its main results are presented below. 2 Experimental Settings We carried out a preliminary study in order to evaluate the general suitability of four different synthetic HMM-voices (2m, 2f) for the purpose of being used for the verbal interaction between humans and a virtual agent or humanoid robot within typical apartment interactions. 8 native German participants (age range 30-55 years, 4m/4f) took part in the experimental study. The evaluation software was implemented in JAVA and the experiment was carried out on individual personal computers in the participants’ homes or quiet offices. The participants used stereo headsets to listen to the experimental stimuli.
منابع مشابه
Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques
One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...
متن کاملDesigning a Robust Control Scheme for Robotic Systems with an Adaptive Observer
This paper introduces a robust task-space control scheme for a robotic system with an adaptive observer. The proposed approach does not require the availability of the system states and an adaptive observer is developed to estimate the state variables. These estimated states are then used in the control scheme. First, the dynamic model of a robot is derived. Next, an observer-based robust contr...
متن کاملAutomatic prominence annotation of a German speech synthesis corpus: towards prominence-based prosody generation for unit selection synthesis
This paper describes work directed towards the development of a syllable prominence-based prosody generation functionality for a German unit selection speech synthesis system. A general concept for syllable prominence-based prosody generation in unit selection synthesis is proposed. As a first step towards its implementation, an automated syllable prominence annotation procedure based on acoust...
متن کاملInvestigating and Analysing Instructional Design and Workplace Learning Models and Selection of Adaptive Model to Optimize Organizational Training in Petrochemical Industry
The present research aimed to analyze instructional design,workplace learning, and selecting the optimum model of learning for human resources training in petrochemical industry.The previous roles have become faint and new opportunities have appeared in petrochemical industry by starting the process of privatization and changing the nature of the company from holding to a governance and develop...
متن کامل"You Talking to Me?" Exploring Voice in Self-Service User Interfaces
Automated teller machines (ATMs) are a classic example of ubiquitous computing as they pervade our everyday life and, for many, are typical artifacts of convenient, modern living. More important, most people are unaware that in using an ATM they are connected via a computer to a powerful network. Through our research reported within this article, we examine the user acceptance and usability asp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015